{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "cf89fbaa",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/output_parsing/df_program.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "530c973e-916d-4c9e-9365-e2d5306d7e3d",
   "metadata": {},
   "source": [
    "# DataFrame Structured Data Extraction"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "18461ba1-6978-4b5b-861e-6dceec36857b",
   "metadata": {},
   "source": [
    "This demo shows how you can extract tabular DataFrames from raw text.\n",
    "\n",
    "This was directly inspired by jxnl's dataframe example here: https://github.com/jxnl/openai_function_call/blob/main/auto_dataframe.py.\n",
    "\n",
    "We show this with different levels of complexity, all backed by the OpenAI Function API:\n",
    "- (more code) How to build an extractor yourself using our OpenAIPydanticProgram\n",
    "- (less code) Using our out-of-the-box `DFFullProgram` and `DFRowsProgram` objects\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "fb240ba9-2f23-4686-8097-4f1d7bdf02cb",
   "metadata": {},
   "source": [
    "## Build a DF Extractor Yourself (Using OpenAIPydanticProgram)\n",
    "\n",
    "Our OpenAIPydanticProgram is a wrapper around an OpenAI LLM that supports function calling - it will return structured\n",
    "outputs in the form of a Pydantic object.\n",
    "\n",
    "We import our `DataFrame` and `DataFrameRowsOnly` objects.\n",
    "\n",
    "To create an output extractor, you just need to 1) specify the relevant Pydantic object, and 2) Add the right prompt"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "bb74e24d",
   "metadata": {},
   "source": [
    "If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bc357ded",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-llms-openai\n",
    "%pip install llama-index-program-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "101e1779",
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c3f2918-7776-4b59-88da-07299cda4f25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.program.openai import OpenAIPydanticProgram\n",
    "from llama_index.core.program import (\n",
    "    DFFullProgram,\n",
    "    DataFrame,\n",
    "    DataFrameRowsOnly,\n",
    ")\n",
    "from llama_index.llms.openai import OpenAI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bcb69ed8-7b02-4fac-b6b6-b09c301fc2cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "program = OpenAIPydanticProgram.from_defaults(\n",
    "    output_cls=DataFrame,\n",
    "    llm=OpenAI(temperature=0, model=\"gpt-4-0613\"),\n",
    "    prompt_template_str=(\n",
    "        \"Please extract the following query into a structured data according\"\n",
    "        \" to: {input_str}.Please extract both the set of column names and a\"\n",
    "        \" set of rows.\"\n",
    "    ),\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "420c26b2-f652-436c-a661-d594c56496c8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Function call: DataFrame with args: {\n",
      "  \"columns\": [\n",
      "    {\n",
      "      \"column_name\": \"Name\",\n",
      "      \"column_desc\": \"Name of the person\"\n",
      "    },\n",
      "    {\n",
      "      \"column_name\": \"Age\",\n",
      "      \"column_desc\": \"Age of the person\"\n",
      "    },\n",
      "    {\n",
      "      \"column_name\": \"City\",\n",
      "      \"column_desc\": \"City where the person lives\"\n",
      "    },\n",
      "    {\n",
      "      \"column_name\": \"Hobby\",\n",
      "      \"column_desc\": \"What the person likes to do\"\n",
      "    }\n",
      "  ],\n",
      "  \"rows\": [\n",
      "    {\n",
      "      \"row_values\": [\"John\", 25, \"New York\", \"play basketball\"]\n",
      "    },\n",
      "    {\n",
      "      \"row_values\": [\"Mike\", 30, \"San Francisco\", \"play baseball\"]\n",
      "    },\n",
      "    {\n",
      "      \"row_values\": [\"Sarah\", 20, \"Los Angeles\", \"play tennis\"]\n",
      "    },\n",
      "    {\n",
      "      \"row_values\": [\"Mary\", 35, \"Chicago\", \"play tennis\"]\n",
      "    }\n",
      "  ]\n",
      "}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "DataFrame(description=None, columns=[DataFrameColumn(column_name='Name', column_desc='Name of the person'), DataFrameColumn(column_name='Age', column_desc='Age of the person'), DataFrameColumn(column_name='City', column_desc='City where the person lives'), DataFrameColumn(column_name='Hobby', column_desc='What the person likes to do')], rows=[DataFrameRow(row_values=['John', 25, 'New York', 'play basketball']), DataFrameRow(row_values=['Mike', 30, 'San Francisco', 'play baseball']), DataFrameRow(row_values=['Sarah', 20, 'Los Angeles', 'play tennis']), DataFrameRow(row_values=['Mary', 35, 'Chicago', 'play tennis'])])"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# NOTE: the test example is taken from jxnl's repo\n",
    "\n",
    "response_obj = program(\n",
    "    input_str=\"\"\"My name is John and I am 25 years old. I live in \n",
    "        New York and I like to play basketball. His name is \n",
    "        Mike and he is 30 years old. He lives in San Francisco \n",
    "        and he likes to play baseball. Sarah is 20 years old \n",
    "        and she lives in Los Angeles. She likes to play tennis.\n",
    "        Her name is Mary and she is 35 years old. \n",
    "        She lives in Chicago.\"\"\"\n",
    ")\n",
    "response_obj"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fe28dc6d-98f9-4530-a5c4-255ae68d6d9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "program = OpenAIPydanticProgram.from_defaults(\n",
    "    output_cls=DataFrameRowsOnly,\n",
    "    llm=OpenAI(temperature=0, model=\"gpt-4-0613\"),\n",
    "    prompt_template_str=(\n",
    "        \"Please extract the following text into a structured data:\"\n",
    "        \" {input_str}. The column names are the following: ['Name', 'Age',\"\n",
    "        \" 'City', 'Favorite Sport']. Do not specify additional parameters that\"\n",
    "        \" are not in the function schema. \"\n",
    "    ),\n",
    "    verbose=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ac6260c-c06a-48e6-9f7e-1892dd988eac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Function call: DataFrameRowsOnly with args: {\n",
      "  \"rows\": [\n",
      "    {\n",
      "      \"row_values\": [\"John\", 25, \"New York\", \"basketball\"]\n",
      "    },\n",
      "    {\n",
      "      \"row_values\": [\"Mike\", 30, \"San Francisco\", \"baseball\"]\n",
      "    },\n",
      "    {\n",
      "      \"row_values\": [\"Sarah\", 20, \"Los Angeles\", \"tennis\"]\n",
      "    },\n",
      "    {\n",
      "      \"row_values\": [\"Mary\", 35, \"Chicago\", \"\"]\n",
      "    }\n",
      "  ]\n",
      "}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "DataFrameRowsOnly(rows=[DataFrameRow(row_values=['John', 25, 'New York', 'basketball']), DataFrameRow(row_values=['Mike', 30, 'San Francisco', 'baseball']), DataFrameRow(row_values=['Sarah', 20, 'Los Angeles', 'tennis']), DataFrameRow(row_values=['Mary', 35, 'Chicago', ''])])"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "program(\n",
    "    input_str=\"\"\"My name is John and I am 25 years old. I live in \n",
    "        New York and I like to play basketball. His name is \n",
    "        Mike and he is 30 years old. He lives in San Francisco \n",
    "        and he likes to play baseball. Sarah is 20 years old \n",
    "        and she lives in Los Angeles. She likes to play tennis.\n",
    "        Her name is Mary and she is 35 years old. \n",
    "        She lives in Chicago.\"\"\"\n",
    ")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "acbcbca8-78f1-47cd-9507-81b2c78ba6fe",
   "metadata": {},
   "source": [
    "## Use our DataFrame Programs\n",
    "\n",
    "We provide convenience wrappers for `DFFullProgram` and `DFRowsProgram`. This allows a simpler object creation interface than specifying all details through the `OpenAIPydanticProgram`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e9b4c40f-5426-4b6f-9b97-b612c63f772d",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.program.openai import OpenAIPydanticProgram\n",
    "from llama_index.core.program import DFFullProgram, DFRowsProgram\n",
    "import pandas as pd\n",
    "\n",
    "# initialize empty df\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"Name\": pd.Series(dtype=\"str\"),\n",
    "        \"Age\": pd.Series(dtype=\"int\"),\n",
    "        \"City\": pd.Series(dtype=\"str\"),\n",
    "        \"Favorite Sport\": pd.Series(dtype=\"str\"),\n",
    "    }\n",
    ")\n",
    "\n",
    "# initialize program, using existing df as schema\n",
    "df_rows_program = DFRowsProgram.from_defaults(\n",
    "    pydantic_program_cls=OpenAIPydanticProgram, df=df\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d7eeba46-25bd-499b-a7c4-d7f475c4de21",
   "metadata": {},
   "outputs": [],
   "source": [
    "# parse text, using existing df as schema\n",
    "result_obj = df_rows_program(\n",
    "    input_str=\"\"\"My name is John and I am 25 years old. I live in \n",
    "        New York and I like to play basketball. His name is \n",
    "        Mike and he is 30 years old. He lives in San Francisco \n",
    "        and he likes to play baseball. Sarah is 20 years old \n",
    "        and she lives in Los Angeles. She likes to play tennis.\n",
    "        Her name is Mary and she is 35 years old. \n",
    "        She lives in Chicago.\"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "455a96c0-b4dd-46cc-a8f0-b70241733e2e",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jerryliu/Programming/gpt_index/llama_index/program/predefined/df.py:65: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n",
      "  return existing_df.append(new_df, ignore_index=True)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>City</th>\n",
       "      <th>Favorite Sport</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>John</td>\n",
       "      <td>25</td>\n",
       "      <td>New York</td>\n",
       "      <td>Basketball</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Mike</td>\n",
       "      <td>30</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>Baseball</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Sarah</td>\n",
       "      <td>20</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>Tennis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mary</td>\n",
       "      <td>35</td>\n",
       "      <td>Chicago</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Name  Age           City Favorite Sport\n",
       "0   John   25       New York     Basketball\n",
       "1   Mike   30  San Francisco       Baseball\n",
       "2  Sarah   20    Los Angeles         Tennis\n",
       "3   Mary   35        Chicago               "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_obj.to_df(existing_df=df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22ff7819-51be-4e84-a537-6e5b32aa4e4d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize program that can do joint schema extraction and structured data extraction\n",
    "df_full_program = DFFullProgram.from_defaults(\n",
    "    pydantic_program_cls=OpenAIPydanticProgram,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dda44cc0-a3f4-4ba8-9304-a6f511722e07",
   "metadata": {},
   "outputs": [],
   "source": [
    "result_obj = df_full_program(\n",
    "    input_str=\"\"\"My name is John and I am 25 years old. I live in \n",
    "        New York and I like to play basketball. His name is \n",
    "        Mike and he is 30 years old. He lives in San Francisco \n",
    "        and he likes to play baseball. Sarah is 20 years old \n",
    "        and she lives in Los Angeles. She likes to play tennis.\n",
    "        Her name is Mary and she is 35 years old. \n",
    "        She lives in Chicago.\"\"\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40067e73-2a28-45f6-a5f5-c54004bef4ad",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Name</th>\n",
       "      <th>Age</th>\n",
       "      <th>Location</th>\n",
       "      <th>Hobby</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>John</td>\n",
       "      <td>25</td>\n",
       "      <td>New York</td>\n",
       "      <td>Basketball</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Mike</td>\n",
       "      <td>30</td>\n",
       "      <td>San Francisco</td>\n",
       "      <td>Baseball</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Sarah</td>\n",
       "      <td>20</td>\n",
       "      <td>Los Angeles</td>\n",
       "      <td>Tennis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Mary</td>\n",
       "      <td>35</td>\n",
       "      <td>Chicago</td>\n",
       "      <td></td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    Name  Age       Location       Hobby\n",
       "0   John   25       New York  Basketball\n",
       "1   Mike   30  San Francisco    Baseball\n",
       "2  Sarah   20    Los Angeles      Tennis\n",
       "3   Mary   35        Chicago            "
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "result_obj.to_df()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e4420b92-0da4-4b1e-8f84-8936c3c81246",
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize empty df\n",
    "df = pd.DataFrame(\n",
    "    {\n",
    "        \"City\": pd.Series(dtype=\"str\"),\n",
    "        \"State\": pd.Series(dtype=\"str\"),\n",
    "        \"Population\": pd.Series(dtype=\"int\"),\n",
    "    }\n",
    ")\n",
    "\n",
    "# initialize program, using existing df as schema\n",
    "df_rows_program = DFRowsProgram.from_defaults(\n",
    "    pydantic_program_cls=OpenAIPydanticProgram, df=df\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7bd7c9ed-2a40-4efd-b09a-44c95bd302fd",
   "metadata": {},
   "outputs": [],
   "source": [
    "input_text = \"\"\"San Francisco is in California, has a population of 800,000. \n",
    "New York City is the most populous city in the United States. \\\n",
    "With a 2020 population of 8,804,190 distributed over 300.46 square miles (778.2 km2), \\\n",
    "New York City is the most densely populated major city in the United States.\n",
    "New York City is in New York State.\n",
    "Boston (US: /ˈbɔːstən/),[8] officially the City of Boston, is the capital and largest city of the Commonwealth of Massachusetts \\\n",
    "and the cultural and financial center of the New England region of the Northeastern United States. \\\n",
    "The city boundaries encompass an area of about 48.4 sq mi (125 km2)[9] and a population of 675,647 as of 2020.[4]\n",
    "\"\"\"\n",
    "\n",
    "# parse text, using existing df as schema\n",
    "result_obj = df_rows_program(input_str=input_text)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16fa5b01-e12c-4151-b554-c7de470b4255",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/jerryliu/Programming/gpt_index/llama_index/program/predefined/df.py:65: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.\n",
      "  return existing_df.append(new_df, ignore_index=True)\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>City</th>\n",
       "      <th>State</th>\n",
       "      <th>Population</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>San Francisco</td>\n",
       "      <td>California</td>\n",
       "      <td>800000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>New York City</td>\n",
       "      <td>New York</td>\n",
       "      <td>8804190</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Boston</td>\n",
       "      <td>Massachusetts</td>\n",
       "      <td>675647</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            City          State  Population\n",
       "0  San Francisco     California      800000\n",
       "1  New York City       New York     8804190\n",
       "2         Boston  Massachusetts      675647"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "new_df = result_obj.to_df(existing_df=df)\n",
    "new_df"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
